Skip to content

Various Corrdiff optimizations for drastic increase of training efficiency #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 1, 2025

Conversation

LostnEkko
Copy link
Contributor

@LostnEkko LostnEkko commented Mar 14, 2025

Various Corrdiff optimizations for drastic increase of training efficiency

Description

  • Updated CorrDiff training code to support multiple patch iterations
    to amortize regression cost and usage of torch.compile
  • Refactored modulus/models/diffusion/layers.py to optimize data type casting workflow,
    avoiding unnecessary casting under autocast mode
  • Refactored Conv2d to enable fusion of conv2d with bias addition
  • Refactored GroupNorm, UNetBlock, SongUNet, SongUNetPosEmbd to support usage of
    Apex GroupNorm, fusion of activation with GroupNorm, and AMP workflow.
  • Updated SongUNetPosEmbd to avoid unnecessary HtoD Memcpy of pos_embd
  • Updated from_checkpoint to accommodate usage of Apex GroupNorm
  • Refactored CorrDiff NVTX annotation workflow to be configurable

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@CharlelieLrt CharlelieLrt self-requested a review March 14, 2025 18:07
@CharlelieLrt CharlelieLrt added enhancement New feature or request 3 - Ready for Review Ready for review by team 5 - Merge After Dependencies Depends on another PR: do not merge out of order Earth-2 labels Mar 14, 2025
@mnabian
Copy link
Collaborator

mnabian commented Mar 14, 2025

/blossom-ci

@CharlelieLrt CharlelieLrt mentioned this pull request Mar 18, 2025
5 tasks
@simonbyrne
Copy link
Contributor

What's the status of this? I would love to make use of these in ReGen.

@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented Apr 2, 2025

@simonbyrne it's currently blocked by #790 and under review, but that will be merged in the coming days.
AFAIK the current implementation of ReGen does not support these optimizations and there will be some work required to do so.

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

Signed-off-by: jialusui1102 <[email protected]>
@CharlelieLrt
Copy link
Collaborator

/blossom-ci

@loliverhennigh
Copy link
Collaborator

Hey @jialusui1102, @CharlelieLrt and talked about the backward compatibility issues this PR and this PR #790 raised. For now we can get this in but I will fix the backward compatibility stuff ASAP after. @CharlelieLrt and I discussed a solution that seems to solve all the issues. Ill need you @jialusui1102 to take a look at the PR when the time comes though to make sure this works with the corrdiff model.

@jialusui1102
Copy link
Collaborator

Hey @loliverhennigh Thanks for letting me know and merging my PR in and thanks @CharlelieLrt for coordinating. Let me know when the PR is ready and I will test the corrdiff checkpoint to make sure everything works!

@CharlelieLrt CharlelieLrt self-requested a review May 1, 2025 19:20
Copy link
Collaborator

@CharlelieLrt CharlelieLrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

@CharlelieLrt CharlelieLrt merged commit 1a6288d into NVIDIA:main May 1, 2025
1 check passed
@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented May 3, 2025

Leaving some comments about a few things that were left out and could be addressed by future PRs:

  1. Optimizations are not enabled for the SongUnetPosLetEmb architecture. Ideally we would like to merge (subclass) with the other SonUNets, such that we don't need to reimplement the optimizations.
  2. Using patch-wise gradient accumulation still requires many manual changes to training loop. Would be good to automate this with something like a wrapper: loss_fn = patch_wise_gradient_accumulation(loss_fn).
  3. Need to restructure the Conv2D layers, where the case kernel=0 is an edge case (that is actually used only once in the SongUnet) that complicates the implementation of these optimzations. Ideally the case kermel=0 should be handled by a different class.

@shrek shrek mentioned this pull request May 5, 2025
5 tasks
daviddpruitt added a commit to AtmosSci-DLESM/modulus-uw that referenced this pull request May 7, 2025
* Add CELU activation function (NVIDIA#851)

* refactor: updating naming of a few files (modulus -> physicsnemo) (NVIDIA#850)

Co-authored-by: Oliver Hennigh <[email protected]>

* Various Corrdiff optimizations for drastic increase of training efficiency (NVIDIA#809)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <[email protected]>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <[email protected]>

* Lint and format code properly

Signed-off-by: Neal Pan <[email protected]>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <[email protected]>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <[email protected]>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <[email protected]>

* update tests

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <[email protected]>

* formatting

Signed-off-by: jialusui1102 <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>

* Catch improper use of patch gradient accumulation (NVIDIA#868)

* Update train.py to catch improper use of path grad acc

* Update train.py

* Update train.py

* Fixes compile of regression model in train.py

* Removed unused imports

Signed-off-by: Charlelie Laurent <[email protected]>

* Changed grad patch accumulation logic

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: nekobytz <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
coreyjadams added a commit that referenced this pull request May 12, 2025
* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

* Add CELU activation function (#851)

* refactor: updating naming of a few files (modulus -> physicsnemo) (#850)

Co-authored-by: Oliver Hennigh <[email protected]>

* Various Corrdiff optimizations for drastic increase of training efficiency (#809)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <[email protected]>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <[email protected]>

* Lint and format code properly

Signed-off-by: Neal Pan <[email protected]>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <[email protected]>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <[email protected]>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <[email protected]>

* update tests

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <[email protected]>

* formatting

Signed-off-by: jialusui1102 <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>

* Catch improper use of patch gradient accumulation (#868)

* Update train.py to catch improper use of path grad acc

* Update train.py

* Update train.py

* Fixes compile of regression model in train.py

* Removed unused imports

Signed-off-by: Charlelie Laurent <[email protected]>

* Changed grad patch accumulation logic

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* This commit fixes two minor bugs in the physicsnemo profiling tools (#862)

- If line_profiler isn't available, it sometimes broke due to a missing check.
- If the torch profiler is used but the code exits before profiling, it will crash.

* Adding abokov-nv to authorized users to trigger blossom-ci.yml (#867)

* Fixes indexing issues and CPU memory consumption in dataloader (#879)

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* add WeightedOceanMSE to criterion

* add optional gaussian noise to inputs and coupled variables during training - should improve coupled stability

* add random seed - still need to test

* remove datatransformer code - shouldn't be part of this PR

* move logging

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* change back to 'n_layers' to match the old models

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* fix memory leak in coupled timeseries

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* change back to 'n_layers' to match the old models

* fix memory leak in coupled timeseries

* add coupler fixes, var and time selection

* Fix for ordering on coupler

* batch size fix in coupler

* broken workflow cleanup

* cleanup for upstream merge (#20)

---------

Signed-off-by: root <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>

* CorrDiff: inference bugfixes, cleanup, and documentation improvements (#882)

* Disabled cuda profiler for cpu runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Added __init__ to avoid dataset module collision

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled torch emit_nvtx for cpu runs. Renamed 'test_train_split' to 'validation'

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed typo in 'set_patch_num'

Signed-off-by: Charlelie Laurent <[email protected]>

* More profiling stats disabled for CPU runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Removed duplicate code in ResidualLoss

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled AMP in inference

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed f-strings in train script

Signed-off-by: Charlelie Laurent <[email protected]>

* Added details about validation and early-stopping in readme

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* Stormcast customization conditions (#880)

* Add configurable model inputs

* Ensure pure diffusion works

* Update docstrings, error handling

* Update StormCast README.md and docstrings

* Minor revisions to StormCast README.md

* Fix typo in StormCast README.md

* Making the unit tests' nfs-data-dir configurable (#866)

* Update msc test with import_or_skip decorator (#884)

* update msc test with import_or_skip decorator

* linting

* update package name

---------

Co-authored-by: root <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Signed-off-by: root <[email protected]>
Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: nekobytz <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
Co-authored-by: abokov-nv <[email protected]>
Co-authored-by: David Pruitt <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>
Co-authored-by: root <[email protected]>
coreyjadams added a commit that referenced this pull request May 12, 2025
* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

* Add CELU activation function (#851)

* refactor: updating naming of a few files (modulus -> physicsnemo) (#850)

Co-authored-by: Oliver Hennigh <[email protected]>

* Various Corrdiff optimizations for drastic increase of training efficiency (#809)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <[email protected]>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <[email protected]>

* Lint and format code properly

Signed-off-by: Neal Pan <[email protected]>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <[email protected]>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <[email protected]>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <[email protected]>

* update tests

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <[email protected]>

* formatting

Signed-off-by: jialusui1102 <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>

* Catch improper use of patch gradient accumulation (#868)

* Update train.py to catch improper use of path grad acc

* Update train.py

* Update train.py

* Fixes compile of regression model in train.py

* Removed unused imports

Signed-off-by: Charlelie Laurent <[email protected]>

* Changed grad patch accumulation logic

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* This commit fixes two minor bugs in the physicsnemo profiling tools (#862)

- If line_profiler isn't available, it sometimes broke due to a missing check.
- If the torch profiler is used but the code exits before profiling, it will crash.

* Adding abokov-nv to authorized users to trigger blossom-ci.yml (#867)

* Fixes indexing issues and CPU memory consumption in dataloader (#879)

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* add WeightedOceanMSE to criterion

* add optional gaussian noise to inputs and coupled variables during training - should improve coupled stability

* add random seed - still need to test

* remove datatransformer code - shouldn't be part of this PR

* move logging

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* change back to 'n_layers' to match the old models

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* fix memory leak in coupled timeseries

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* change back to 'n_layers' to match the old models

* fix memory leak in coupled timeseries

* add coupler fixes, var and time selection

* Fix for ordering on coupler

* batch size fix in coupler

* broken workflow cleanup

* cleanup for upstream merge (#20)

---------

Signed-off-by: root <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>

* CorrDiff: inference bugfixes, cleanup, and documentation improvements (#882)

* Disabled cuda profiler for cpu runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Added __init__ to avoid dataset module collision

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled torch emit_nvtx for cpu runs. Renamed 'test_train_split' to 'validation'

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed typo in 'set_patch_num'

Signed-off-by: Charlelie Laurent <[email protected]>

* More profiling stats disabled for CPU runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Removed duplicate code in ResidualLoss

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled AMP in inference

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed f-strings in train script

Signed-off-by: Charlelie Laurent <[email protected]>

* Added details about validation and early-stopping in readme

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* Stormcast customization conditions (#880)

* Add configurable model inputs

* Ensure pure diffusion works

* Update docstrings, error handling

* Update StormCast README.md and docstrings

* Minor revisions to StormCast README.md

* Fix typo in StormCast README.md

* Making the unit tests' nfs-data-dir configurable (#866)

* Update msc test with import_or_skip decorator (#884)

* update msc test with import_or_skip decorator

* linting

* update package name

---------

Co-authored-by: root <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Signed-off-by: root <[email protected]>
Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: nekobytz <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
Co-authored-by: abokov-nv <[email protected]>
Co-authored-by: David Pruitt <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>
Co-authored-by: root <[email protected]>
coreyjadams added a commit that referenced this pull request May 16, 2025
* adding multiscale feature to model

* bug fixes in model

* removing non-dim scaling from domino datapipe

* adding tests for multiscale training and inference script

* minor fix for surface training blow up

* fixing bug in inference

* surface volume radii

* refactoring

* hyper tuning

* fixing duplication in model

* model improvement

* adding resampling and geo encoding type

* Caching Mechanism For DoMINO Training (#805)

* Profiling code

* Fix typo

* Factor out compute_scaling_factors to re-use in cache script

* Add caching capabilities

* Domino datapipe

* Missed removal

* Hotfixes

* Modify cached config

* Prune unused imports, add requirements and formatting from pre-commit hook

* Simplify domino dataset handling in presence of cache

* Renaming

* Add the Datacenter use case (#783)

* Uploaded core files for the datacenter case!

* add licensing text

* add readme

* update datapipe, make caching optional

* update Readme

* update changelog

* update requirements.txt

---------

Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: root <[email protected]>

* Update README.md (#769)

Learnning -> Learning

Co-authored-by: Nicholas Geneva <[email protected]>

* Update README.md (#780)

Added link to preprint paper for Domino and XAeronet

Co-authored-by: Nicholas Geneva <[email protected]>

* Merge dlwp-healpix updates from modulus-uw (#785)

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* add WeightedOceanMSE to criterion

* add optional gaussian noise to inputs and coupled variables during training - should improve coupled stability

* add random seed - still need to test

* remove datatransformer code - shouldn't be part of this PR

* move logging

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* change back to 'n_layers' to match the old models

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* merge fixes and doc updates

* Fix on batch size check, logging cleanup

* test and lint fixes

* update coupler doc

* add test for missing fail condition on dlwp coupler

* fix error message for coupler

* add missing tests

* add missing tests

* adding tests for dlwp-healpix

* setup for upstream merge

* add test for multi symmetric conv block

* add missing tests, remove unreachable code

* update tests, remove unreachable code

* add additional tests for dlwp healpix couplers

* add tests and docs, cleanup code

* switch to import_or_fail decorator

* update to import_or_fail idiom

* add test for WeightedOceanMSE

---------

Signed-off-by: root <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>

* Update Dockerfile (#791)

* Add random walk noise and kinematic mask (#786)

* Move to experiment-based Hydra config. Refactor logging.

* Update README and configs.

* Delete old configs.

* Revert "Delete old configs."

This reverts commit 5c13a1f.

* Refactor LagrangianDataset to support different noise schedules.

* Add random walk noise support to LagrangianDataset.

* Address review comments.

* Merge branch 'main' into lagrangian-mgn

* Revert "Merge branch 'main' into lagrangian-mgn"

This reverts commit 3b4f41d.

* Update CHANGELOG

* Add kinematic particles mask.

* Remove unused code.

* Fix unit test, update configs, inference script cleanup.

* Address review feedback.

* Update docstrings.

* Adds new Modulus devs to /blossom-ci authorized users (#792)

* Adds Peter S to Blossom-CI auth user list

* Adds a few other recently-added team members ot the CI list

* Dockerfile changes to handle onnxruntime dependency (#793)

* Update Dockerfile

* update onnx installation, and update tests

* update test onnx utils

* Fix NCCL_ASYNC_ERROR_HANDLING deprecation warning (#711)

* Fix NCCL_ASYNC_ERROR_HANDLING deprecation warning

It looks like the patch from pytorch/pytorch#114077 landed in torch 2.2.0.

Fixes #568.

* Update CHANGELOG.md

* Profiling (#787)

* Stashing profiling work

* Torch profile works but is very slow.  line profiler not functional at this time

* Enablement of profiling tool with pytorch profiler, as a context manager.  Still several TBD Objects but this implementation will capture a torch profile.

* Moving profiling tools into a directory to make separate tools more clearly separated as well as enable easier extensions.

* Profiling tools work with torch profiler and line_profiler.  nsys has a crash that I haven't resolved yet.

* Fix line profiling construction

* Begin instrumenting figconvnet and adding tutorials on modulus profiling tools

* Remove annotations and force all annotations to conform to nvtx.  Simpler, for now, and the most (only?) useful annotation tool

* Updating profiling tutorial

* Minor updates to profiling interfaces

* only adding some profiling hooks to figconvnet

* Add profiling hooks to mesh graph net.

* Set TELayerNorm to default layer norm in MeshGraphNet

* Nearly finished profiling tutorial and tooling example.  Just need to add images.

* Final (first) draft of the profiling tutorial and clean up profiler code slightly.  Ready for draft PR

* Add tests to the profiler tools to check functionality.  Thanks Cursor!

Some minor updtes to the tools themselves to accomodate instance clearing and refreshing.

* Update changelog for profiling tools

* Update profiler files to (hopefully) pass CI checks

* Remove profiling parts from capture.py for later integration

* Update __init__.py

Remove nvtx wrapper

* Add extra line to make linting happy...

* When cuda is not available (mostly CI), emit a warning and switch to native layer norm.

* Make the default as LayerNorm so tests will pass.  Needs more care in the test, I think, about TELayerNorm

* Very minor fixes per review

* Resolve most comments from PR review.  One to go (profiler state to become a literal)

* Change profiler state tracker to a single state with an enum type.

* Two changes made here:
- the exit stack moves from a class variable to an instance variable
- The double-check locking mechanism in the registry becomes a single lock and check.

* Make sure the exit stack init is actually in __init__ and not initialize()

* Enable Domain Parallelism with ShardTensor (#784)

* Enable mesh-based parallelism as the configuration backend, even for simple DDP sharding

* Fix small typo in docstring

* Remove  unnecessary  functions with new interface

* Adding first implementation of ShardTensor prototype.  Still several pieces are WIP but this has basic functionality supported for creation and forward usage.

* Working implementation of ShardTensor, though still somewhate incomplete.

* Adding work-in-progress examples.  Be careful of sharp edges!

* A few more example pieces before natten will work out of the box.  Most of the ops have been validated, all that remains is to  wrap the na2d function call to ensure it will dispatch properly.

* Fix naming scheme

* Minor name change

* Add monkey patching for na2d operation with shard tensors

* Fix bug in shard tensor inference of globla size.  CHeck agains sharding in unbind op rules.

* Enable backwards gradients for halo sharding and natten patch

* Convolution 2d backwards works, though would be  better to catch torch.ops.aten.convolution.default.

* Fix missing import and ensure tensors are contiguous before allgather_v

* Clean up and remove unnecessary noise and printouts for debugging

* Unify (and correct!) the sharded convolution implementation.  There was also a minor bug in the backward
pass that got more pronounced with smaller data: grad inputs were failing to properly collect
haloed gradients and add them on the edges.  Now fixed.

* Remove noise from sharding utils.

* For smaller tensors, the alltoall step of halo reductions might be significant overhead.
I'm implementing here an option to switch to peer to peer message passing, since it might
benefit from stream utilization in layers like natten.na2d.

It's a developer choice currently, not a user choice.

* Remove shard_utils file, it is a subfolder.

* Add modulus ShardTensor api documentation

* Clean up doc strings, type annotations and mesh implementation.  No significant functionality changes in this commit.

* Add significant docstring / type annotation cleanup to ShardTensor.

Add `scatter_tensor` function to enable more easy transition to shard tensor.
This function allows users to maintain data pipelines (on one rank) and easily
scatter that data to a domain mesh.

* Remove neighborhood attention prototypes

* Remove the rest of these examples since they are outdated and unnecessary

* Mostly, this commit is adding type annotations and doc strings.

But also, this adjusts the shard tensor mechanism for tracking shard info to use
a dict instead of a list of tuples.

* Clean up and document conv patches.
No real code changes applied here.

* clean up and improve documentation and type hints for shard utils worker functions

* Adding basic tests for shard tensor initialization and redistribution.

There appears to be one corner case in redistribute to fix.  TBD.

Tests for grad propogation are coming.

* Add full working example of multilevel parallelism with pytorch
FSDP and modulus ShardTensor

* Add missing type annotations

* Ensure scatter_tensor is available to import from modulus.distributed

* Update changelog and ensure wrapt is a optional dependency

* Update fsdp_and_shard_tensor.rst

Update tutorial based on feedback from @pzharrington

* Update __init__.py

Remove wildcard import.

* Update shard_tensor.py

fix spacing

* This is an essential bug fix for a missing import

* Update branch to pass CI tests.

* This commit provides several pieces:

- First, the ability to transpose the sharding dimensions is supported.  For square submeshs, 2x2 for example,
the output sharding will match the input sharding if it's uneven.  This can only be supported if the number of
devices in the output mesh dimension is equal to the input dimension, hence the restriction on square submeshes.
Other scenarios will apply dtensor-like chunk syntax, but return a shard tensor tracking that split.  Comprehensive
tests on 1D and 2D meshes are included here.  No testing is done at this time on 3D sharding / meshes.

- Second, the issues with torch.mean are intercepted and fixed.  This uses a new dispatch intercept (below)
and applies a weight to the mean, and converts the Partial placement to a Partial(sum) with the weight applied.
This has a bug that appears to be present in DTensor too: reductions over non-sharded dimensions appear to falter.
To be fixed in a future release.

- Third, ShardTensor has a new class attribute to accomodate operator interceptions.  The only applied function
at this time are variants of aten.mean, however, it is expected to convert all monkey patching to this syntax.

* Update monkey patching to ensure patches get applied by modulus, and don't require
them to trigger elsewhere.  If ShardTensor is used, the patches get applied.

Also, minor updates to docs.

* Codify ShardTensor and FSDP in tutorials.

* Apparently, codify'ing in rst requires double ticks.

* This commit fixes gradient propagation for unevenly sharded tensors.  Tests are coming in the next commit immediately after.

* Add tests for shard tensor: initialization, resharding, and gradient sharding.

Further, fixed an annoying bug in other distributed tests where OS environs weren't cleared after testing, and tsome tests would fail but only if others ran first.

Now, all distributed tests use a context manager to change OS environment variables locally only.

* Two things done here:
- Enable dynamic (off by default) wrapping of layers by shard tensor.  they get turned on automatically when a shard tensor is created.
- Rename the utils to manage env variables.

Tests are failing with unusual CPU errors on ORD.  Moving to github runners ...

* Disable patched operations by default.

* name change

* name change docs

* These two files should not be included in the release.  They are generated...

* RC fixes 1

* L-MGN: improve inference

* Remove obsolete config

* Docs fixes

* Readme updates

* Add notice about the rename

* Profiler Fixes.  Duplicate of #172

* backward compatibility fix with old modulus namespace

* Add custom installation of pyspng for arm

* post release updates to version, add migration guide to readme and update changelog

* Post rename updates (#816)

* post merge name changes

* some more updates

* updates

* Initial ReGen model release (#810)

* initial regen release

* add readme

* cleanup figures, use existing crps routine

* update changelog

* Bug entry point (#818)

* fixed grid effect

* added entrypoint fix

* whit space

* V2 name change

* fixed regisry

* fixed regisry

* CI

* removed back check

* fixed brocken dock string

* blaa fix

---------

Co-authored-by: Oliver Hennigh <[email protected]>

* Address pytorch versioning issues. (#820)

* This commit address version compatibility issues with pytorch.

Many new features of physicsnemo's distributed utilities, targeting domain parallelism,
require pytorch's DTensor package which was introduced in pytorch 2.6.0.  But, we don't
want to limit physicsnemo usage unnecessarily.

This commit introduces version checking utilities, which are then aplied to ShardTensor.
If torch is below 2.6.0, the distributed utilities will not import ShardTensor but
will still work.  If a user attempts to import ShardTensor directly, avoiding the
__init__.py  file, the version checking utilities will raise an exception.

Tests on shard tensor are likewise skipped if torch 2.6.0 is not installed.

Finally, an additional test file is included to validate the version checking tools.

* This commit further protects against older versions of pytorch
- change shard tensor minimum version to 2.5.9 to accomodate alpha release of 2.6.0a
- set minimum pytorch version for DeviceMesh to 2.4.0
- introduce function decorator that raises an exception when unavailable functions are called.
- adds a little more protection in the tests to differntiate,

---------

* 1.0.1 rc rebase (#829)

* Comment warnings setting (#830)

* Update pyproject.toml links (#832)

Replace `modulus` links with updated `physicsnemo` links.

Co-authored-by: Nicholas Geneva <[email protected]>

* Update README.md reference link (#821)

* Update README.md

---------

Co-authored-by: Nicholas Geneva <[email protected]>

* Update README.md (#833)

* Update README.md

* Dockerfile Fixes (#835)

* Update dockerfile

* Update dockerfile

* Order swap

* update

* Swap again

* add FORCE_CUDA flags to torch-scatter and torch-cluster source installs, install makani and fignet dependencies explicitly

---------

Co-authored-by: Kaustubh Tangsali <[email protected]>

* MSC Checkpointing Changes (#789)

* Working changes to be cleaned up.

* Rename msc_config.yaml

* Fixed pytorch test issue by removing MSC Cache

* Updated project dependencies

* Find MSC config using absolute path.

* Re-added cuda test parameter.

* Add test to read from public S3 bucket using MSC.

* Revert save_checkpoint_freq value.

* Remove temporary printing

* Remove unnecessary dependency

* Switched to use consistent mechanism for detecting msc URIs

* Moved fsspec.filesystem logic into filesystem.py

* Change to cache for non-file protocols when reading non-modulus models.

* Moved code to generate checkpoint directory.directory

* Added get_checkpoint_dir import

* Address review feedback.

* Changes from code review.

* Addressed file test issue from review.

* Fix to file existence check.

* Fix merge conflicts due to project name change.

* Updated CHANGELOG.

* Added Multi-Storage Client to allow checkpointing to/from Object Storage

Signed-off-by: Chris Hawes <[email protected]>

* Addressed issues identified by pre-commit.

* Update filesystem.py

* Update __init__.py

* Update Dockerfile

---------

Signed-off-by: Chris Hawes <[email protected]>
Co-authored-by: Nicholas Geneva <[email protected]>

* Fixes DeprecationWarning introduced in setuptools>=77 (#837)

* Fixes DeprecationWarning introduced in setuptools>=77

* setuptools does not allow redundant license specification in project.license and project.classifiers

* Cordiff usability and performance enhancements for custom dataset training (#790)

* Add recent checkpoints option, adjust configs

* Doc for deterministic_sampler

* Typo fix

* Bugfix and cleanup of corrdiff regression loss and UNet

* Minor fix in docstrings

* Bugfix + doc for corrdiff regression CE loss

* Refactor corrdiff configs for custom dataset

* Bugfix in configs

* Added info in corrdiff docs for custom training

* Minor change in corrdiff config

* bring back base config file removed by mistake

* Added config for generation on custom dataset

* Forgot some config files

* Fixed overlap pixel in custom config based on discussion in PR #703

* Corrdiff fixes to enable non-squared images and/or non-square patches. Needs testing.

* Fix small bug in config

* Removed arguments redundancy in patching utilities + fixed hight-width order

* Cleanup

* Added tests for rectangle images and patches

* Added wandb logging for corrdiff training

* Implements patching API. Refactors corrdiff train abnd generate to use it

* Corrdiff function to register new custom dataset

* Reorganize configs again

* Correction in configs: training duration is NOT in kilo images

* Readme re-write

* Updated CHANGELOG

* Fixed formatting

* Test fixes

* Typo fix

* Fixes on patching API

* Fixed patching bug and tests

* Simplifications in corrdiff diffusion step

* Forgot to propagate change to test for cordiff diffusion step

* Renamed patching API to explicit 2D

* Fixed shape in test

* Replace loops with fold/unfold patching for perf

* Added method to dynamically change number of patches in RandomPatching

* Adds safety checks for patch shapes in patching function. Fixes tests

* Fixes docs

* Forgot a fix in docs

* New embedding selection strategy in CorrDiff UNet models

* Updated CHANGELOG.md

* Fixed tests for SongUNet position emneddings

* More robust tests for patching

* Fixed docs bug

* More bugfixes in doc tests

* Some renaming

Signed-off-by: Charlelie Laurent <[email protected]>

* Bugfixes, cleanup, docstrings

Signed-off-by: Charlelie Laurent <[email protected]>

* Docstring improvement for UNet and EDMPrecondSR

Signed-off-by: Charlelie Laurent <[email protected]>

* Docs for InfiniteSampler

Signed-off-by: Charlelie Laurent <[email protected]>

* Corrected Readme info about training/generate from checkpoints

Signed-off-by: Charlelie Laurent <[email protected]>

* Bugfixes in generate scripts, cleanup debugging flags

Signed-off-by: Charlelie Laurent <[email protected]>

* Removed blank line from changelog

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixes in CI tests

Signed-off-by: Charlelie Laurent <[email protected]>

* Forgot to commit one of the CI fixes

Signed-off-by: Charlelie Laurent <[email protected]>

* Fix example in doc

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>

* Update from_checkpoint docs (#843)

* resolving merge conflicts

* fixing minor issues

* fixing conflicts

* fixing bugs

* Update README.md

* Domino perf (#848)

* Disable the `length` variables in BallQuery.  They are unused, but still allocate memory
and are saved for the backwards pass.  It's not necessary since it's never used, as
far as I can tell.

* Optimizations and efficiency improvements in the domino datapipe.  Highlights are:
- Alter numpy file format slightly to no longer require pickle.  TODO: needs a fallback if this fails.
  Removing the requirement on pickle allows slightly faster data loading with threading.
- Separate surface, volume, and joint preprocessing into stand alone functions.  This isn't super
  useful immediately but the end goal, if not subsampling, is to put the volume and surface pipelines
  in separate cuda streams to overlap them.
- Reverse the order of the kNN calculation in the surface preprocessing.  The kNN originally
  was finding all k neighbors of all points in the surface.  Now, if sampling, we find only
  neighbors of points that survive the sampling.  This is a 50x reduction in computational cost.
  (the 2% un-reduced cost comes from the unchanged need to build a search tree over the whole mesh)
- Rework and optimize several sampling functions: instead of creating an index for all points,
  randomizing it, and taking the front; now the functions will simply choose N_points at random.
  (This does not really help in the weighted sampling functions)
- Introduce a custom collation function for torch to bring cupy arrays to torch arrays without copy.
- All other small operations have been ported to cupy, which gives further benefits.

There are still to-dos:
- validate this works without cupy
- make sure this works without sampling (even if slow)
- Fix the need to jump to CPU for sdf and area_weighted_sampling.

* Remove obsolete and unused dataclasses - it's a flat config heirarchy, these are vestigal.

* This commit enables reading the old-style pickled files by default.  We can switch to
threaded reading when the preprocessing is ready and available.

* Provide more robust reading of pickled files.

Ensure compute_scale_factors works even with GPU preprocessing.

* Fix several small bugs: the dataloader sometimes implicitly uses cupy instead of
selecting based on config.

* Fix issue if using CPU data loading.

* Ensure all gpu preprocessing is directed to the proper device

* Ensure that the dataloader doesn't waste GPU memory.  Previously, loading
in a context on device != 0 would allocate memory on device 0.

* Enable zarr readers.  Use file path to toggle which type of file to read.

* Improve logging and track memory leak.  Enable zarr.

* Add GPU monitoring to the training script, and recreate the knn class each iteration.  Otherwise, it leads to a memory leak.

* Enforce the determinism request in the domino pipeline.

* This commit makes an improvement to the zarr reading: reads are now _chunk_aligned_
and read directly into a numpy buffer.  This enables better multithreading since each
thread only interfaces with one zarr chunk.

* Put ALL zarr chunk reads into futures and thread the IO.
Limited by the threadpool and IO speed.  It'd be nice to
stream right into pinned memory but it seems to be too
large data reads for that pool.  TBD.

* Introduce a Sharded data pipeline for DoMINO.  This class is constructed from the standard
pipeline, with several extra pieces of information:
- the domain mesh over which the data pipeline is sharded
- Whether to shard point-like outputs (volume fields, surface fields, etc)
- Whether to shard grid-like outputs

This commit also includes some minor refinements to the standard pipeline
to make bootstraping a sharded version functional.

* bug fix - validation step commented out

* minor fixes to train.py

* Fix CUPY float/int datatype casting. (#852)

* This commit addresses an issue where the mesh indexes were being improperly
converted to float32 at some point.  This enables the preprocessing workflow
to stay on the GPU for this section, if the data is on GPU.

* Update domino_datapipe.py

Fix bug in min/max joining.

* Update model.py (#855)

Replace torch.expand + torch.gather with torch.index_select.  This saves a huge amount of memory and is computationally even a little faster.

On 80GB, number of points can be increased from about 6000 to 60000.

* modifying train.py

* minor fixes

* Domino Loss Functions (#853)

* This commit creates alternative versions of the domino loss functions that are
significantly simpler and shorter, while producing numerically consistent
results.

Original functions are maintained in this commit and the training script
compares individual loss components as well as total loss.

* Remove older loss functions and consolidate script.

* fourier features to model params and cleanup

* modifying train.py

* minor fixes

* merging changes in train.py

* Merges `main` branch back into `domino` branch (#856)

* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

---------

Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>

* DoMINO Model Refactor (#840)

* Remove unused imports, add typing on functions and DoMINO constructor

* Adds type hints

* Removes both unused imports and unused positional encoding function (never called).

* Removes calculate_gradient() function for sdf, which can be replaced with the one-liner torch.gradient(). Note that this is not exactly numerically identical, as:

 a) the old method did not divide the central difference by 2 to account for the doubled dx on a central difference, while this does.

 b) the old method replaced the ends with zero gradients, while the new method replaces them with one-sided first-order finite differences.

* Adds a docstring to scale_sdf()

* Replaces super(DoMINO, self) with super(), which is better practice in Python 3. (The former breaks upon inheritance, while the latter does not)

* Ruff formatting pass.

* Removes binarize_sdf(), which can be performed as a one-liner to enhance readability.

* type hints

* Adds documentation and readability refactors on ball_query warp kernel.

* Adds differentiability note

* Makes wp.tid() naming consistent across warp kernels, for readability.

* Adds type hinting on backwards pass.

* Adds docs and type hints on BallQueryLayer

* Conciseness

* Adds docs for BQWarp

* Adds forward pass docs for GeoProcessor

* Adds docs for BQWarp

* Adds docs for GeoConvOut

* Adds docs for GeoProcessor

* Functional change: removes padded_value=-10 default, which seems like dead code.

* Refactors layers for readability, and fixes an important bug: in the 3rd level of the downsampling, conv2 was accidentally used twice, and conv3 was never used (in the batch_norm branch).

* Adds docs

* Ruff format pass

* Ruff check fixes

* Fixes black formatting

* Removes geometry_encoder(), which is never used (other calls already directly use self.geo_rep, so this is dead code)

* Fixes mutable default arguments

* Adds ValueError for a potential silent error

* Fixes typos

* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

* black formatting pass

* Format imports

* black formatting pass

* Fixes #840 (comment)

* markdownlint fix

* Remove unused input_features parameter from BQWarp instantiation in GeometryRep and DoMINO classes.

* Remove batch normalization layers and non-configurable flag from GeoProcessor class in model.py. Related to discussion here: #840 (comment)

* Fixes a bug where negative areas were causing NaNs in demo

* formatting

---------

Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>

* updating model and fixing bug in datapipe

* Explicit Warp device allocation for SDF and Ball Query (#876)

* Explicit warp device management

* explicit warp device management in SDF

* Update sdf.py

* Update sdf.py

* Update sdf.py

* Update ball_query.py

* Update sdf.py

* Update CHANGELOG.md

* Update sdf.py

* A few fixes for the domino pipeline. (#863)

- initialize the distributed manager, if it isn't already.
- For partial datasets (surface only, volumen only) don't move "None"
  objects to cupy.
- When sampling/shuffling, if the number of points is too high then
  don't error.  Instead, shuffle and rely on padding.

* Domino merge from `main` (#888)

* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

* Add CELU activation function (#851)

* refactor: updating naming of a few files (modulus -> physicsnemo) (#850)

Co-authored-by: Oliver Hennigh <[email protected]>

* Various Corrdiff optimizations for drastic increase of training efficiency (#809)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <[email protected]>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <[email protected]>

* Lint and format code properly

Signed-off-by: Neal Pan <[email protected]>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <[email protected]>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <[email protected]>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <[email protected]>

* update tests

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <[email protected]>

* formatting

Signed-off-by: jialusui1102 <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>

* Catch improper use of patch gradient accumulation (#868)

* Update train.py to catch improper use of path grad acc

* Update train.py

* Update train.py

* Fixes compile of regression model in train.py

* Removed unused imports

Signed-off-by: Charlelie Laurent <[email protected]>

* Changed grad patch accumulation logic

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* This commit fixes two minor bugs in the physicsnemo profiling tools (#862)

- If line_profiler isn't available, it sometimes broke due to a missing check.
- If the torch profiler is used but the code exits before profiling, it will crash.

* Adding abokov-nv to authorized users to trigger blossom-ci.yml (#867)

* Fixes indexing issues and CPU memory consumption in dataloader (#879)

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* add WeightedOceanMSE to criterion

* add optional gaussian noise to inputs and coupled variables during training - should improve coupled stability

* add random seed - still need to test

* remove datatransformer code - shouldn't be part of this PR

* move logging

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* change back to 'n_layers' to match the old models

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* fix memory leak in coupled timeseries

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* change back to 'n_layers' to match the old models

* fix memory leak in coupled timeseries

* add coupler fixes, var and time selection

* Fix for ordering on coupler

* batch size fix in coupler

* broken workflow cleanup

* cleanup for upstream merge (#20)

---------

Signed-off-by: root <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>

* CorrDiff: inference bugfixes, cleanup, and documentation improvements (#882)

* Disabled cuda profiler for cpu runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Added __init__ to avoid dataset module collision

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled torch emit_nvtx for cpu runs. Renamed 'test_train_split' to 'validation'

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed typo in 'set_patch_num'

Signed-off-by: Charlelie Laurent <[email protected]>

* More profiling stats disabled for CPU runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Removed duplicate code in ResidualLoss

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled AMP in inference

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed f-strings in train script

Signed-off-by: Charlelie Laurent <[email protected]>

* Added details about validation and early-stopping in readme

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* Stormcast customization conditions (#880)

* Add configurable model inputs

* Ensure pure diffusion works

* Update docstrings, error handling

* Update StormCast README.md and docstrings

* Minor revisions to StormCast README.md

* Fix typo in StormCast README.md

* Making the unit tests' nfs-data-dir configurable (#866)

* Update msc test with import_or_skip decorator (#884)

* update msc test with import_or_skip decorator

* linting

* update package name

---------

Co-authored-by: root <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Signed-off-by: root <[email protected]>
Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: nekobytz <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
Co-authored-by: abokov-nv <[email protected]>
Co-authored-by: David Pruitt <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>
Co-authored-by: root <[email protected]>

* Update tests to accomodate new domino model.  Minor tweaks to domino (#889)

docstring to make clears sizes > 0.

* fixing minor bugs

* Exclusively fix linting errors. (#895)

* Domino datapipe test (#896)

* Fix ruff error

* Add test for domino datapipe

* Fix ruff error.

* Remove numpy conversion since sdf now returns a numpy array directly

* Enable cupy usage in computing scaling factors.

* fixing PR comments

* fixing bug in cache data

* editing readme

* block failing code path.  only fails on CI system, which has a different driver.  Could be a driver issue.

* Update model.py

Read resolution from config instead of hard coding it.

* Update sdf.py

Remove .numpy call in doc string - sdf will return based on the input format.

* Update model.py

Fix typo

* Update sdf.py

Tweak output in docstring.

* Update model.py

Remove module printout ...

* Update Makefile

Temporarily reduce coverage requirements.

---------

Signed-off-by: root <[email protected]>
Signed-off-by: Chris Hawes <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Rishi Ranade <[email protected]>
Co-authored-by: nvssh nssswitch user account <[email protected]>
Co-authored-by: Michael Mara <[email protected]>
Co-authored-by: Derek Lai <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Ikko Eltociear Ashimine <[email protected]>
Co-authored-by: Nicholas Geneva <[email protected]>
Co-authored-by: ram-cherukuri <[email protected]>
Co-authored-by: David Pruitt <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Alexey Kamenev <[email protected]>
Co-authored-by: Peter Sharpe <[email protected]>
Co-authored-by: Simon Byrne <[email protected]>
Co-authored-by: Alexey Kamenev <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: WG <[email protected]>
Co-authored-by: chris-hawes <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: RishikeshRanade <[email protected]>
Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Mohammad Amin Nabian <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-authored-by: nekobytz <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: abokov-nv <[email protected]>
Co-authored-by: root <[email protected]>
coreyjadams added a commit that referenced this pull request May 21, 2025
* adding multiscale feature to model

* bug fixes in model

* removing non-dim scaling from domino datapipe

* adding tests for multiscale training and inference script

* minor fix for surface training blow up

* fixing bug in inference

* surface volume radii

* Profiling (#787)

* Stashing profiling work

* Torch profile works but is very slow.  line profiler not functional at this time

* Enablement of profiling tool with pytorch profiler, as a context manager.  Still several TBD Objects but this implementation will capture a torch profile.

* Moving profiling tools into a directory to make separate tools more clearly separated as well as enable easier extensions.

* Profiling tools work with torch profiler and line_profiler.  nsys has a crash that I haven't resolved yet.

* Fix line profiling construction

* Begin instrumenting figconvnet and adding tutorials on modulus profiling tools

* Remove annotations and force all annotations to conform to nvtx.  Simpler, for now, and the most (only?) useful annotation tool

* Updating profiling tutorial

* Minor updates to profiling interfaces

* only adding some profiling hooks to figconvnet

* Add profiling hooks to mesh graph net.

* Set TELayerNorm to default layer norm in MeshGraphNet

* Nearly finished profiling tutorial and tooling example.  Just need to add images.

* Final (first) draft of the profiling tutorial and clean up profiler code slightly.  Ready for draft PR

* Add tests to the profiler tools to check functionality.  Thanks Cursor!

Some minor updtes to the tools themselves to accomodate instance clearing and refreshing.

* Update changelog for profiling tools

* Update profiler files to (hopefully) pass CI checks

* Remove profiling parts from capture.py for later integration

* Update __init__.py

Remove nvtx wrapper

* Add extra line to make linting happy...

* When cuda is not available (mostly CI), emit a warning and switch to native layer norm.

* Make the default as LayerNorm so tests will pass.  Needs more care in the test, I think, about TELayerNorm

* Very minor fixes per review

* Resolve most comments from PR review.  One to go (profiler state to become a literal)

* Change profiler state tracker to a single state with an enum type.

* Two changes made here:
- the exit stack moves from a class variable to an instance variable
- The double-check locking mechanism in the registry becomes a single lock and check.

* Make sure the exit stack init is actually in __init__ and not initialize()

* Enable Domain Parallelism with ShardTensor (#784)

* Enable mesh-based parallelism as the configuration backend, even for simple DDP sharding

* Fix small typo in docstring

* Remove  unnecessary  functions with new interface

* Adding first implementation of ShardTensor prototype.  Still several pieces are WIP but this has basic functionality supported for creation and forward usage.

* Working implementation of ShardTensor, though still somewhate incomplete.

* Adding work-in-progress examples.  Be careful of sharp edges!

* A few more example pieces before natten will work out of the box.  Most of the ops have been validated, all that remains is to  wrap the na2d function call to ensure it will dispatch properly.

* Fix naming scheme

* Minor name change

* Add monkey patching for na2d operation with shard tensors

* Fix bug in shard tensor inference of globla size.  CHeck agains sharding in unbind op rules.

* Enable backwards gradients for halo sharding and natten patch

* Convolution 2d backwards works, though would be  better to catch torch.ops.aten.convolution.default.

* Fix missing import and ensure tensors are contiguous before allgather_v

* Clean up and remove unnecessary noise and printouts for debugging

* Unify (and correct!) the sharded convolution implementation.  There was also a minor bug in the backward
pass that got more pronounced with smaller data: grad inputs were failing to properly collect
haloed gradients and add them on the edges.  Now fixed.

* Remove noise from sharding utils.

* For smaller tensors, the alltoall step of halo reductions might be significant overhead.
I'm implementing here an option to switch to peer to peer message passing, since it might
benefit from stream utilization in layers like natten.na2d.

It's a developer choice currently, not a user choice.

* Remove shard_utils file, it is a subfolder.

* Add modulus ShardTensor api documentation

* Clean up doc strings, type annotations and mesh implementation.  No significant functionality changes in this commit.

* Add significant docstring / type annotation cleanup to ShardTensor.

Add `scatter_tensor` function to enable more easy transition to shard tensor.
This function allows users to maintain data pipelines (on one rank) and easily
scatter that data to a domain mesh.

* Remove neighborhood attention prototypes

* Remove the rest of these examples since they are outdated and unnecessary

* Mostly, this commit is adding type annotations and doc strings.

But also, this adjusts the shard tensor mechanism for tracking shard info to use
a dict instead of a list of tuples.

* Clean up and document conv patches.
No real code changes applied here.

* clean up and improve documentation and type hints for shard utils worker functions

* Adding basic tests for shard tensor initialization and redistribution.

There appears to be one corner case in redistribute to fix.  TBD.

Tests for grad propogation are coming.

* Add full working example of multilevel parallelism with pytorch
FSDP and modulus ShardTensor

* Add missing type annotations

* Ensure scatter_tensor is available to import from modulus.distributed

* Update changelog and ensure wrapt is a optional dependency

* Update fsdp_and_shard_tensor.rst

Update tutorial based on feedback from @pzharrington

* Update __init__.py

Remove wildcard import.

* Update shard_tensor.py

fix spacing

* This is an essential bug fix for a missing import

* Update branch to pass CI tests.

* This commit provides several pieces:

- First, the ability to transpose the sharding dimensions is supported.  For square submeshs, 2x2 for example,
the output sharding will match the input sharding if it's uneven.  This can only be supported if the number of
devices in the output mesh dimension is equal to the input dimension, hence the restriction on square submeshes.
Other scenarios will apply dtensor-like chunk syntax, but return a shard tensor tracking that split.  Comprehensive
tests on 1D and 2D meshes are included here.  No testing is done at this time on 3D sharding / meshes.

- Second, the issues with torch.mean are intercepted and fixed.  This uses a new dispatch intercept (below)
and applies a weight to the mean, and converts the Partial placement to a Partial(sum) with the weight applied.
This has a bug that appears to be present in DTensor too: reductions over non-sharded dimensions appear to falter.
To be fixed in a future release.

- Third, ShardTensor has a new class attribute to accomodate operator interceptions.  The only applied function
at this time are variants of aten.mean, however, it is expected to convert all monkey patching to this syntax.

* Update monkey patching to ensure patches get applied by modulus, and don't require
them to trigger elsewhere.  If ShardTensor is used, the patches get applied.

Also, minor updates to docs.

* Codify ShardTensor and FSDP in tutorials.

* Apparently, codify'ing in rst requires double ticks.

* This commit fixes gradient propagation for unevenly sharded tensors.  Tests are coming in the next commit immediately after.

* Add tests for shard tensor: initialization, resharding, and gradient sharding.

Further, fixed an annoying bug in other distributed tests where OS environs weren't cleared after testing, and tsome tests would fail but only if others ran first.

Now, all distributed tests use a context manager to change OS environment variables locally only.

* Two things done here:
- Enable dynamic (off by default) wrapping of layers by shard tensor.  they get turned on automatically when a shard tensor is created.
- Rename the utils to manage env variables.

Tests are failing with unusual CPU errors on ORD.  Moving to github runners ...

* Disable patched operations by default.

* name change

* name change docs

* refactoring

* This commit addresses two issues:
- First, there is a bug upstream in pytorch.  The profile will currently fall over with stack=True, so its off by default here.
- Second, refactoring the state led to some subtle logic errors, where previously it was
  possible to have enabled and initialized both be true.  That broke.  This commit fixes
  by assuming an numerical state progression, so it's now a __ge__ comparision and if the
  state is ENABLED for example, @property(initialized()) evaluates to true.

* hyper tuning

* fixing duplication in model

* Minor fixes and updates to the profiling utility.

* Add functionality to distributed manager to provide mesh-wide groups.

A pytorch DeviceMesh provides syntax to to access the ProcessGroup across a mesh dimension.
Sometimes, it is most easy to access a group that includes all devices of a mesh.  This isn't
included in the upstream syntax, so it's built out here.  Because the cost of creating a group
is not small, the groups are cached using the devicemesh itself (hashed) as a key.

In 1D meshes, the underlying group for that mesh is returned rather than creating a new group.

* Performance enhancements to shard tensor.  Not fully optimized yet but better.

The main performance issue with ShardTensor appears to be blocking DtoH and HtoD
copies for infering the sharding shapes and sizes.

Construction of a shard tensor now takes an optional argument `sharding_shapes`
to dictate how sharding shapes are determined.

"infer" will use group-wide communication to allgather the shapes on each sharded
mesh dimension.

"chunk" will assume DTensor-like chunking.  A sanity check will be performed,
that the local shape matches the computed shape.  In the event that the local
shape matches but only on one rank, this could lead to a hang - it is because
the input DTensor would have been incorrectly formatted.  No communication
is done with "chunk" method unless the sanity check fails.

Sharding shapes can be passed directly, too.  Global shape will be inferred
in this case.

Additionally, `scatter_tensor` has been made a little more versatile
at the cost of slightly worse performance.  There are possible optimizations
but unlikely to provide serious performance benefits yet.

* model improvement

* Hot fix - interface for mesh names was incorrect.

* Small updates to ShardTensor and redistribution methods.

* This commit improves the functionality, readability, and maintainability of the halo communication.

It is incorporated into both conv* as well as natten.  Other operations that require a halo
on image-like data should be supportable with this update more easily.

* Add support for a select group of conv_transpose shapes where kernel == padding

* Enable group normalization with shard tensor.

* Add attention mechanism (scaled_dot_product_attention) to supported ShardTensor ops.

* Add average pooling fucntionality for select shapes.

* Enable pooling, normalization, and attention patches when registering shard wrappers

* Remove printouts ...

* This commit addresses issues that arose in the merge of my feature branch,
after it diverged from the renamed release branch.

It also fixes a small typo in a warning message in BallQuery.

* Add a sharding propagation for aten.select.int.

Now, a = b[:,:,3] will work on sharded tensors as long as you aren't
selecting on the sharded dimension.

* Reorganize the halo and ring message passing to be easier to follow and maintain.

This reorganization isolates the layers of halo passing by conceptual
operation (building the halos, communicate halos, apply halos, slice
off residuals, etc).  The code is a little longer but the upshot
is the availability of high level functions `halo_padding` and `unhalo_padding`
which are both differentiable and easily applied.

Also introduces a ring message passing function.  Note that this function
is synchronous (so is the halo) and while the halo needs to be synchronous,
ring message passing often does not.  It's included nevertheless as a
simple, easy to use version to enable debugging of the overlapped version.

Both Halos and Rings and now configured with light dataclass objects
to make the number of arguments passed around simpler, and easier to maintain
state between forward and backward passes.

* This commit adds support for Max Pooling, Unpooling via nearest neighbor upsampling,
attention layers via sequence parallelism, and a semi-parallel version of BallQuery
from physicsnemo.

With this commit, the DoMINO model can be used in a domain-parallel way with ShardTensor.
Some optimizations and one further level of parallelism remain.

* This commit adds tests for RingBallQuery (which is ball query on sharded tensors).

It also includes some edge case fixes for the ball query ring ops, which change minor
details in a the attention patches.

* make sure that convolutions and ball query compute output shapes and don't perform a blocking op

* Add profiling hooks to convolution wrapper and halo padding.

Also removes an unnecessary `to` in halo padding.

* adding resampling and geo encoding type

* Disable the `length` variables in BallQuery.  They are unused, but still allocate memory
and are saved for the backwards pass.  It's not necessary since it's never used, as
far as I can tell.

* This commit applies some reorganizations to the ball query layer to enable the distributed interception
more easily.  In summary:
- All warp code and conversions is pushed into functional primitives that I can also
  leverage from the shard tensor wrapper.
- Contexts are maintained as pure torch tensors to allieviate any type casting issues,
  which I was hitting in the distributed version.
- Using primitive functions also lets me maintain only one context more easily.  If forced
  to go through the base layer, I have to duplicate activation saving in the distributed verions.

The gradients in the distributed version are not _quite_ finished yet.

* Caching Mechanism For DoMINO Training (#805)

* Profiling code

* Fix typo

* Factor out compute_scaling_factors to re-use in cache script

* Add caching capabilities

* Domino datapipe

* Missed removal

* Hotfixes

* Modify cached config

* Prune unused imports, add requirements and formatting from pre-commit hook

* Simplify domino dataset handling in presence of cache

* Renaming

* Add the Datacenter use case (#783)

* Uploaded core files for the datacenter case!

* add licensing text

* add readme

* update datapipe, make caching optional

* update Readme

* update changelog

* update requirements.txt

---------

Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: root <[email protected]>

* Update README.md (#769)

Learnning -> Learning

Co-authored-by: Nicholas Geneva <[email protected]>

* Update README.md (#780)

Added link to preprint paper for Domino and XAeronet

Co-authored-by: Nicholas Geneva <[email protected]>

* Merge dlwp-healpix updates from modulus-uw (#785)

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* add WeightedOceanMSE to criterion

* add optional gaussian noise to inputs and coupled variables during training - should improve coupled stability

* add random seed - still need to test

* remove datatransformer code - shouldn't be part of this PR

* move logging

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* change back to 'n_layers' to match the old models

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* merge fixes and doc updates

* Fix on batch size check, logging cleanup

* test and lint fixes

* update coupler doc

* add test for missing fail condition on dlwp coupler

* fix error message for coupler

* add missing tests

* add missing tests

* adding tests for dlwp-healpix

* setup for upstream merge

* add test for multi symmetric conv block

* add missing tests, remove unreachable code

* update tests, remove unreachable code

* add additional tests for dlwp healpix couplers

* add tests and docs, cleanup code

* switch to import_or_fail decorator

* update to import_or_fail idiom

* add test for WeightedOceanMSE

---------

Signed-off-by: root <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>

* Update Dockerfile (#791)

* Add random walk noise and kinematic mask (#786)

* Move to experiment-based Hydra config. Refactor logging.

* Update README and configs.

* Delete old configs.

* Revert "Delete old configs."

This reverts commit 5c13a1fafcf648e90f10573ae505cdc30ca91e27.

* Refactor LagrangianDataset to support different noise schedules.

* Add random walk noise support to LagrangianDataset.

* Address review comments.

* Merge branch 'main' into lagrangian-mgn

* Revert "Merge branch 'main' into lagrangian-mgn"

This reverts commit 3b4f41d432b02329ecb6de230bc3db2b5648237a.

* Update CHANGELOG

* Add kinematic particles mask.

* Remove unused code.

* Fix unit test, update configs, inference script cleanup.

* Address review feedback.

* Update docstrings.

* Adds new Modulus devs to /blossom-ci authorized users (#792)

* Adds Peter S to Blossom-CI auth user list

* Adds a few other recently-added team members ot the CI list

* Dockerfile changes to handle onnxruntime dependency (#793)

* Update Dockerfile

* update onnx installation, and update tests

* update test onnx utils

* Fix NCCL_ASYNC_ERROR_HANDLING deprecation warning (#711)

* Fix NCCL_ASYNC_ERROR_HANDLING deprecation warning

It looks like the patch from https://github.com/pytorch/pytorch/pull/114077 landed in torch 2.2.0.

Fixes #568.

* Update CHANGELOG.md

* Profiling (#787)

* Stashing profiling work

* Torch profile works but is very slow.  line profiler not functional at this time

* Enablement of profiling tool with pytorch profiler, as a context manager.  Still several TBD Objects but this implementation will capture a torch profile.

* Moving profiling tools into a directory to make separate tools more clearly separated as well as enable easier extensions.

* Profiling tools work with torch profiler and line_profiler.  nsys has a crash that I haven't resolved yet.

* Fix line profiling construction

* Begin instrumenting figconvnet and adding tutorials on modulus profiling tools

* Remove annotations and force all annotations to conform to nvtx.  Simpler, for now, and the most (only?) useful annotation tool

* Updating profiling tutorial

* Minor updates to profiling interfaces

* only adding some profiling hooks to figconvnet

* Add profiling hooks to mesh graph net.

* Set TELayerNorm to default layer norm in MeshGraphNet

* Nearly finished profiling tutorial and tooling example.  Just need to add images.

* Final (first) draft of the profiling tutorial and clean up profiler code slightly.  Ready for draft PR

* Add tests to the profiler tools to check functionality.  Thanks Cursor!

Some minor updtes to the tools themselves to accomodate instance clearing and refreshing.

* Update changelog for profiling tools

* Update profiler files to (hopefully) pass CI checks

* Remove profiling parts from capture.py for later integration

* Update __init__.py

Remove nvtx wrapper

* Add extra line to make linting happy...

* When cuda is not available (mostly CI), emit a warning and switch to native layer norm.

* Make the default as LayerNorm so tests will pass.  Needs more care in the test, I think, about TELayerNorm

* Very minor fixes per review

* Resolve most comments from PR review.  One to go (profiler state to become a literal)

* Change profiler state tracker to a single state with an enum type.

* Two changes made here:
- the exit stack moves from a class variable to an instance variable
- The double-check locking mechanism in the registry becomes a single lock and check.

* Make sure the exit stack init is actually in __init__ and not initialize()

* Enable Domain Parallelism with ShardTensor (#784)

* Enable mesh-based parallelism as the configuration backend, even for simple DDP sharding

* Fix small typo in docstring

* Remove  unnecessary  functions with new interface

* Adding first implementation of ShardTensor prototype.  Still several pieces are WIP but this has basic functionality supported for creation and forward usage.

* Working implementation of ShardTensor, though still somewhate incomplete.

* Adding work-in-progress examples.  Be careful of sharp edges!

* A few more example pieces before natten will work out of the box.  Most of the ops have been validated, all that remains is to  wrap the na2d function call to ensure it will dispatch properly.

* Fix naming scheme

* Minor name change

* Add monkey patching for na2d operation with shard tensors

* Fix bug in shard tensor inference of globla size.  CHeck agains sharding in unbind op rules.

* Enable backwards gradients for halo sharding and natten patch

* Convolution 2d backwards works, though would be  better to catch torch.ops.aten.convolution.default.

* Fix missing import and ensure tensors are contiguous before allgather_v

* Clean up and remove unnecessary noise and printouts for debugging

* Unify (and correct!) the sharded convolution implementation.  There was also a minor bug in the backward
pass that got more pronounced with smaller data: grad inputs were failing to properly collect
haloed gradients and add them on the edges.  Now fixed.

* Remove noise from sharding utils.

* For smaller tensors, the alltoall step of halo reductions might be significant overhead.
I'm implementing here an option to switch to peer to peer message passing, since it might
benefit from stream utilization in layers like natten.na2d.

It's a developer choice currently, not a user choice.

* Remove shard_utils file, it is a subfolder.

* Add modulus ShardTensor api documentation

* Clean up doc strings, type annotations and mesh implementation.  No significant functionality changes in this commit.

* Add significant docstring / type annotation cleanup to ShardTensor.

Add `scatter_tensor` function to enable more easy transition to shard tensor.
This function allows users to maintain data pipelines (on one rank) and easily
scatter that data to a domain mesh.

* Remove neighborhood attention prototypes

* Remove the rest of these examples since they are outdated and unnecessary

* Mostly, this commit is adding type annotations and doc strings.

But also, this adjusts the shard tensor mechanism for tracking shard info to use
a dict instead of a list of tuples.

* Clean up and document conv patches.
No real code changes applied here.

* clean up and improve documentation and type hints for shard utils worker functions

* Adding basic tests for shard tensor initialization and redistribution.

There appears to be one corner case in redistribute to fix.  TBD.

Tests for grad propogation are coming.

* Add full working example of multilevel parallelism with pytorch
FSDP and modulus ShardTensor

* Add missing type annotations

* Ensure scatter_tensor is available to import from modulus.distributed

* Update changelog and ensure wrapt is a optional dependency

* Update fsdp_and_shard_tensor.rst

Update tutorial based on feedback from @pzharrington

* Update __init__.py

Remove wildcard import.

* Update shard_tensor.py

fix spacing

* This is an essential bug fix for a missing import

* Update branch to pass CI tests.

* This commit provides several pieces:

- First, the ability to transpose the sharding dimensions is supported.  For square submeshs, 2x2 for example,
the output sharding will match the input sharding if it's uneven.  This can only be supported if the number of
devices in the output mesh dimension is equal to the input dimension, hence the restriction on square submeshes.
Other scenarios will apply dtensor-like chunk syntax, but return a shard tensor tracking that split.  Comprehensive
tests on 1D and 2D meshes are included here.  No testing is done at this time on 3D sharding / meshes.

- Second, the issues with torch.mean are intercepted and fixed.  This uses a new dispatch intercept (below)
and applies a weight to the mean, and converts the Partial placement to a Partial(sum) with the weight applied.
This has a bug that appears to be present in DTensor too: reductions over non-sharded dimensions appear to falter.
To be fixed in a future release.

- Third, ShardTensor has a new class attribute to accomodate operator interceptions.  The only applied function
at this time are variants of aten.mean, however, it is expected to convert all monkey patching to this syntax.

* Update monkey patching to ensure patches get applied by modulus, and don't require
them to trigger elsewhere.  If ShardTensor is used, the patches get applied.

Also, minor updates to docs.

* Codify ShardTensor and FSDP in tutorials.

* Apparently, codify'ing in rst requires double ticks.

* This commit fixes gradient propagation for unevenly sharded tensors.  Tests are coming in the next commit immediately after.

* Add tests for shard tensor: initialization, resharding, and gradient sharding.

Further, fixed an annoying bug in other distributed tests where OS environs weren't cleared after testing, and tsome tests would fail but only if others ran first.

Now, all distributed tests use a context manager to change OS environment variables locally only.

* Two things done here:
- Enable dynamic (off by default) wrapping of layers by shard tensor.  they get turned on automatically when a shard tensor is created.
- Rename the utils to manage env variables.

Tests are failing with unusual CPU errors on ORD.  Moving to github runners ...

* Disable patched operations by default.

* name change

* name change docs

* These two files should not be included in the release.  They are generated...

* RC fixes 1

* L-MGN: improve inference

* Remove obsolete config

* Docs fixes

* Readme updates

* Add notice about the rename

* Profiler Fixes.  Duplicate of #172

* backward compatibility fix with old modulus namespace

* Add custom installation of pyspng for arm

* post release updates to version, add migration guide to readme and update changelog

* Post rename updates (#816)

* post merge name changes

* some more updates

* updates

* Initial ReGen model release (#810)

* initial regen release

* add readme

* cleanup figures, use existing crps routine

* update changelog

* Bug entry point (#818)

* fixed grid effect

* added entrypoint fix

* whit space

* V2 name change

* fixed regisry

* fixed regisry

* CI

* removed back check

* fixed brocken dock string

* blaa fix

---------

Co-authored-by: Oliver Hennigh <[email protected]>

* Address pytorch versioning issues. (#820)

* This commit address version compatibility issues with pytorch.

Many new features of physicsnemo's distributed utilities, targeting domain parallelism,
require pytorch's DTensor package which was introduced in pytorch 2.6.0.  But, we don't
want to limit physicsnemo usage unnecessarily.

This commit introduces version checking utilities, which are then aplied to ShardTensor.
If torch is below 2.6.0, the distributed utilities will not import ShardTensor but
will still work.  If a user attempts to import ShardTensor directly, avoiding the
__init__.py  file, the version checking utilities will raise an exception.

Tests on shard tensor are likewise skipped if torch 2.6.0 is not installed.

Finally, an additional test file is included to validate the version checking tools.

* This commit further protects against older versions of pytorch
- change shard tensor minimum version to 2.5.9 to accomodate alpha release of 2.6.0a
- set minimum pytorch version for DeviceMesh to 2.4.0
- introduce function decorator that raises an exception when unavailable functions are called.
- adds a little more protection in the tests to differntiate,

---------

* 1.0.1 rc rebase (#829)

* Comment warnings setting (#830)

* Update pyproject.toml links (#832)

Replace `modulus` links with updated `physicsnemo` links.

Co-authored-by: Nicholas Geneva <[email protected]>

* Update README.md reference link (#821)

* Update README.md

---------

Co-authored-by: Nicholas Geneva <[email protected]>

* Update README.md (#833)

* Update README.md

* Dockerfile Fixes (#835)

* Update dockerfile

* Update dockerfile

* Order swap

* update

* Swap again

* add FORCE_CUDA flags to torch-scatter and torch-cluster source installs, install makani and fignet dependencies explicitly

---------

Co-authored-by: Kaustubh Tangsali <[email protected]>

* MSC Checkpointing Changes (#789)

* Working changes to be cleaned up.

* Rename msc_config.yaml

* Fixed pytorch test issue by removing MSC Cache

* Updated project dependencies

* Find MSC config using absolute path.

* Re-added cuda test parameter.

* Add test to read from public S3 bucket using MSC.

* Revert save_checkpoint_freq value.

* Remove temporary printing

* Remove unnecessary dependency

* Switched to use consistent mechanism for detecting msc URIs

* Moved fsspec.filesystem logic into filesystem.py

* Change to cache for non-file protocols when reading non-modulus models.

* Moved code to generate checkpoint directory.directory

* Added get_checkpoint_dir import

* Address review feedback.

* Changes from code review.

* Addressed file test issue from review.

* Fix to file existence check.

* Fix merge conflicts due to project name change.

* Updated CHANGELOG.

* Added Multi-Storage Client to allow checkpointing to/from Object Storage

Signed-off-by: Chris Hawes <[email protected]>

* Addressed issues identified by pre-commit.

* Update filesystem.py

* Update __init__.py

* Update Dockerfile

---------

Signed-off-by: Chris Hawes <[email protected]>
Co-authored-by: Nicholas Geneva <[email protected]>

* Fixes DeprecationWarning introduced in setuptools>=77 (#837)

* Fixes DeprecationWarning introduced in setuptools>=77

* setuptools does not allow redundant license specification in project.license and project.classifiers

* Cordiff usability and performance enhancements for custom dataset training (#790)

* Add recent checkpoints option, adjust configs

* Doc for deterministic_sampler

* Typo fix

* Bugfix and cleanup of corrdiff regression loss and UNet

* Minor fix in docstrings

* Bugfix + doc for corrdiff regression CE loss

* Refactor corrdiff configs for custom dataset

* Bugfix in configs

* Added info in corrdiff docs for custom training

* Minor change in corrdiff config

* bring back base config file removed by mistake

* Added config for generation on custom dataset

* Forgot some config files

* Fixed overlap pixel in custom config based on discussion in PR #703

* Corrdiff fixes to enable non-squared images and/or non-square patches. Needs testing.

* Fix small bug in config

* Removed arguments redundancy in patching utilities + fixed hight-width order

* Cleanup

* Added tests for rectangle images and patches

* Added wandb logging for corrdiff training

* Implements patching API. Refactors corrdiff train abnd generate to use it

* Corrdiff function to register new custom dataset

* Reorganize configs again

* Correction in configs: training duration is NOT in kilo images

* Readme re-write

* Updated CHANGELOG

* Fixed formatting

* Test fixes

* Typo fix

* Fixes on patching API

* Fixed patching bug and tests

* Simplifications in corrdiff diffusion step

* Forgot to propagate change to test for cordiff diffusion step

* Renamed patching API to explicit 2D

* Fixed shape in test

* Replace loops with fold/unfold patching for perf

* Added method to dynamically change number of patches in RandomPatching

* Adds safety checks for patch shapes in patching function. Fixes tests

* Fixes docs

* Forgot a fix in docs

* New embedding selection strategy in CorrDiff UNet models

* Updated CHANGELOG.md

* Fixed tests for SongUNet position emneddings

* More robust tests for patching

* Fixed docs bug

* More bugfixes in doc tests

* Some renaming

Signed-off-by: Charlelie Laurent <[email protected]>

* Bugfixes, cleanup, docstrings

Signed-off-by: Charlelie Laurent <[email protected]>

* Docstring improvement for UNet and EDMPrecondSR

Signed-off-by: Charlelie Laurent <[email protected]>

* Docs for InfiniteSampler

Signed-off-by: Charlelie Laurent <[email protected]>

* Corrected Readme info about training/generate from checkpoints

Signed-off-by: Charlelie Laurent <[email protected]>

* Bugfixes in generate scripts, cleanup debugging flags

Signed-off-by: Charlelie Laurent <[email protected]>

* Removed blank line from changelog

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixes in CI tests

Signed-off-by: Charlelie Laurent <[email protected]>

* Forgot to commit one of the CI fixes

Signed-off-by: Charlelie Laurent <[email protected]>

* Fix example in doc

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>

* Update from_checkpoint docs (#843)

* resolving merge conflicts

* fixing minor issues

* fixing conflicts

* Optimizations and efficiency improvements in the domino datapipe.  Highlights are:
- Alter numpy file format slightly to no longer require pickle.  TODO: needs a fallback if this fails.
  Removing the requirement on pickle allows slightly faster data loading with threading.
- Separate surface, volume, and joint preprocessing into stand alone functions.  This isn't super
  useful immediately but the end goal, if not subsampling, is to put the volume and surface pipelines
  in separate cuda streams to overlap them.
- Reverse the order of the kNN calculation in the surface preprocessing.  The kNN originally
  was finding all k neighbors of all points in the surface.  Now, if sampling, we find only
  neighbors of points that survive the sampling.  This is a 50x reduction in computational cost.
  (the 2% un-reduced cost comes from the unchanged need to build a search tree over the whole mesh)
- Rework and optimize several sampling functions: instead of creating an index for all points,
  randomizing it, and taking the front; now the functions will simply choose N_points at random.
  (This does not really help in the weighted sampling functions)
- Introduce a custom collation function for torch to bring cupy arrays to torch arrays without copy.
- All other small operations have been ported to cupy, which gives further benefits.

There are still to-dos:
- validate this works without cupy
- make sure this works without sampling (even if slow)
- Fix the need to jump to CPU for sdf and area_weighted_sampling.

* Remove obsolete and unused dataclasses - it's a flat config heirarchy, these are vestigal.

* This commit enables reading the old-style pickled files by default.  We can switch to
threaded reading when the preprocessing is ready and available.

* Provide more robust reading of pickled files.

Ensure compute_scale_factors works even with GPU preprocessing.

* fixing bugs

* Fix several small bugs: the dataloader sometimes implicitly uses cupy instead of
selecting based on config.

* Fix issue if using CPU data loading.

* Ensure all gpu preprocessing is directed to the proper device

* Ensure that the dataloader doesn't waste GPU memory.  Previously, loading
in a context on device != 0 would allocate memory on device 0.

* Enable zarr readers.  Use file path to toggle which type of file to read.

* Improve logging and track memory leak.  Enable zarr.

* Add GPU monitoring to the training script, and recreate the knn class each iteration.  Otherwise, it leads to a memory leak.

* Update README.md

* Enforce the determinism request in the domino pipeline.

* This commit makes an improvement to the zarr reading: reads are now _chunk_aligned_
and read directly into a numpy buffer.  This enables better multithreading since each
thread only interfaces with one zarr chunk.

* Put ALL zarr chunk reads into futures and thread the IO.
Limited by the threadpool and IO speed.  It'd be nice to
stream right into pinned memory but it seems to be too
large data reads for that pool.  TBD.

* Introduce a Sharded data pipeline for DoMINO.  This class is constructed from the standard
pipeline, with several extra pieces of information:
- the domain mesh over which the data pipeline is sharded
- Whether to shard point-like outputs (volume fields, surface fields, etc)
- Whether to shard grid-like outputs

This commit also includes some minor refinements to the standard pipeline
to make bootstraping a sharded version functional.

* Domino perf (#848)

* Disable the `length` variables in BallQuery.  They are unused, but still allocate memory
and are saved for the backwards pass.  It's not necessary since it's never used, as
far as I can tell.

* Optimizations and efficiency improvements in the domino datapipe.  Highlights are:
- Alter numpy file format slightly to no longer require pickle.  TODO: needs a fallback if this fails.
  Removing the requirement on pickle allows slightly faster data loading with threading.
- Separate surface, volume, and joint preprocessing into stand alone functions.  This isn't super
  useful immediately but the end goal, if not subsampling, is to put the volume and surface pipelines
  in separate cuda streams to overlap them.
- Reverse the order of the kNN calculation in the surface preprocessing.  The kNN originally
  was finding all k neighbors of all points in the surface.  Now, if sampling, we find only
  neighbors of points that survive the sampling.  This is a 50x reduction in computational cost.
  (the 2% un-reduced cost comes from the unchanged need to build a search tree over the whole mesh)
- Rework and optimize several sampling functions: instead of creating an index for all points,
  randomizing it, and taking the front; now the functions will simply choose N_points at random.
  (This does not really help in the weighted sampling functions)
- Introduce a custom collation function for torch to bring cupy arrays to torch arrays without copy.
- All other small operations have been ported to cupy, which gives further benefits.

There are still to-dos:
- validate this works without cupy
- make sure this works without sampling (even if slow)
- Fix the need to jump to CPU for sdf and area_weighted_sampling.

* Remove obsolete and unused dataclasses - it's a flat config heirarchy, these are vestigal.

* This commit enables reading the old-style pickled files by default.  We can switch to
threaded reading when the preprocessing is ready and available.

* Provide more robust reading of pickled files.

Ensure compute_scale_factors works even with GPU preprocessing.

* Fix several small bugs: the dataloader sometimes implicitly uses cupy instead of
selecting based on config.

* Fix issue if using CPU data loading.

* Ensure all gpu preprocessing is directed to the proper device

* Ensure that the dataloader doesn't waste GPU memory.  Previously, loading
in a context on device != 0 would allocate memory on device 0.

* Enable zarr readers.  Use file path to toggle which type of file to read.

* Improve logging and track memory leak.  Enable zarr.

* Add GPU monitoring to the training script, and recreate the knn class each iteration.  Otherwise, it leads to a memory leak.

* Enforce the determinism request in the domino pipeline.

* This commit makes an improvement to the zarr reading: reads are now _chunk_aligned_
and read directly into a numpy buffer.  This enables better multithreading since each
thread only interfaces with one zarr chunk.

* Put ALL zarr chunk reads into futures and thread the IO.
Limited by the threadpool and IO speed.  It'd be nice to
stream right into pinned memory but it seems to be too
large data reads for that pool.  TBD.

* Introduce a Sharded data pipeline for DoMINO.  This class is constructed from the standard
pipeline, with several extra pieces of information:
- the domain mesh over which the data pipeline is sharded
- Whether to shard point-like outputs (volume fields, surface fields, etc)
- Whether to shard grid-like outputs

This commit also includes some minor refinements to the standard pipeline
to make bootstraping a sharded version functional.

* bug fix - validation step commented out

* Update ball query module to call to the functional interface to leverage
shardtensor's tools.

Finish removing length variables

* minor fixes to train.py

* Fix CUPY float/int datatype casting. (#852)

* This commit addresses an issue where the mesh indexes were being improperly
converted to float32 at some point.  This enables the preprocessing workflow
to stay on the GPU for this section, if the data is on GPU.

* Update domino_datapipe.py

Fix bug in min/max joining.

* This commit creates alternative versions of the domino loss functions that are
significantly simpler and shorter, while producing numerically consistent
results.

Original functions are maintained in this commit and the training script
compares individual loss components as well as total loss.

* Remove older loss functions and consolidate script.

* Merge loss function updates.

* Update model.py (#855)

Replace torch.expand + torch.gather with torch.index_select.  This saves a huge amount of memory and is computationally even a little faster.

On 80GB, number of points can be increased from about 6000 to 60000.

* modifying train.py

* minor fixes

* Domino Loss Functions (#853)

* This commit creates alternative versions of the domino loss functions that are
significantly simpler and shorter, while producing numerically consistent
results.

Original functions are maintained in this commit and the training script
compares individual loss components as well as total loss.

* Remove older loss functions and consolidate script.

* fourier features to model params and cleanup

* modifying train.py

* minor fixes

* merging changes in train.py

* Merges `main` branch back into `domino` branch (#856)

* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

---------

Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>

* DoMINO Model Refactor (#840)

* Remove unused imports, add typing on functions and DoMINO constructor

* Adds type hints

* Removes both unused imports and unused positional encoding function (never called).

* Removes calculate_gradient() function for sdf, which can be replaced with the one-liner torch.gradient(). Note that this is not exactly numerically identical, as:

 a) the old method did not divide the central difference by 2 to account for the doubled dx on a central difference, while this does.

 b) the old method replaced the ends with zero gradients, while the new method replaces them with one-sided first-order finite differences.

* Adds a docstring to scale_sdf()

* Replaces super(DoMINO, self) with super(), which is better practice in Python 3. (The former breaks upon inheritance, while the latter does not)

* Ruff formatting pass.

* Removes binarize_sdf(), which can be performed as a one-liner to enhance readability.

* type hints

* Adds documentation and readability refactors on ball_query warp kernel.

* Adds differentiability note

* Makes wp.tid() naming consistent across warp kernels, for readability.

* Adds type hinting on backwards pass.

* Adds docs and type hints on BallQueryLayer

* Conciseness

* Adds docs for BQWarp

* Adds forward pass docs for GeoProcessor

* Adds docs for BQWarp

* Adds docs for GeoConvOut

* Adds docs for GeoProcessor

* Functional change: removes padded_value=-10 default, which seems like dead code.

* Refactors layers for readability, and fixes an important bug: in the 3rd level of the downsampling, conv2 was accidentally used twice, and conv3 was never used (in the batch_norm branch).

* Adds docs

* Ruff format pass

* Ruff check fixes

* Fixes black formatting

* Removes geometry_encoder(), which is never used (other calls already directly use self.geo_rep, so this is dead code)

* Fixes mutable default arguments

* Adds ValueError for a potential silent error

* Fixes typos

* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

* black formatting pass

* Format imports

* black formatting pass

* Fixes https://github.com/NVIDIA/physicsnemo/pull/840#discussion_r2060715740

* markdownlint fix

* Remove unused input_features parameter from BQWarp instantiation in GeometryRep and DoMINO classes.

* Remove batch normalization layers and non-configurable flag from GeoProcessor class in model.py. Related to discussion here: https://github.com/NVIDIA/physicsnemo/pull/840#discussion_r2060692036

* Fixes a bug where negative areas were causing NaNs in demo

* formatting

---------

Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>

* This commit address a bug in shard tensor: torch.tensor_split and
torch.chunk do not share the same splitting of tensors.  When
redistribute is called, without a "plan" for chunking, it needs
to use torch.chunk to ensure the shapes are what DTensor and the
size validation expect.

This also changes the behavior of the checking: a simple check that
the local shape matches the spec's stored shape along the first mesh
dimension.  If this fails, the code now crashes.  Previously, it was
possible to fail on only some ranks and the collectives became
disordered across ranks.

* Ensure the backwards gradient computations uses consistent types.

* In ring calculations, the global rank was being used to compute source and destination
however this didn't account properly for non-global meshes.  This commit uses local ranks
for determing the ID of source/destination, though they are converted to global
indexing to send the messages.

* Implement sharded version of torch's index_select.

* This commit enables the following pieces:
- torch select and index select now intercept at the torch level, instead
of at aten.  This ensures proper backwards pass gradient sharding.
- Mean and Sum reductions now completely ignore DTensor implementation
on ShardTensors.  The motivation for this is that the backwards pass
in DTensor will not shard gradients properly. It's not an issue in
DTensor but is problematic with domain parallelism.

* This commit handles some of the final updates required to enable full sharding in DoMINO.
- shard_tensor will now shard gradients in the backward pass when converting from torch.Tensor.
- unpooling patches was updated to calculate output shapes with no communication overhead.
- point cloud ops raises an exception if the backwards call in RingBallQuery is called.
  its not implemented correctly, yet, but also not used yet.

* Add profiling hooks to the domino model.

* updating model and fixing bug in datapipe

* This is the last commit enabling sharding.  The model is fully compatible.

Look to train_sharded.py to see the changes for domino domain parallelism.

* Update the domino readme to include information on domain parallelism.

* Explicit Warp device allocation for SDF and Ball Query (#876)

* Explicit warp device management

* explicit warp device management in SDF

* Update sdf.py

* Update sdf.py

* Update sdf.py

* Update ball_query.py

* Update sdf.py

* Update CHANGELOG.md

* Update sdf.py

* A few fixes for the domino pipeline. (#863)

- initialize the distributed manager, if it isn't already.
- For partial datasets (surface only, volumen only) don't move "None"
  objects to cupy.
- When sampling/shuffling, if the number of points is too high then
  don't error.  Instead, shuffle and rely on padding.

* Add first draft of domain parallelism detailed tutorial.

* Update two pieces of shard tensor:
- Enforce consistent naming for sharding shapes (sharding_sizes is not correct, now)
- Remove wrapt interface for conv wrappers, go directly through torch__function

* Add annotations and docstrings to sharded reduction operators

* Domino merge from `main` (#888)

* Stormcast Customization (#799)

* Add common dataloader interface

* Training script runs with refactored dataloader

* More trainer refactoring

* Refactor inference

* Add support for gradient accumulation

* Add support for AMP 16-bit training

* Align training parameters with StormCast paper

* Add comments to inference.py

* Add lite configs

* Add lite configs

* Small bug fixes

* Add support for compiling model

* Validation fixes

* Refactor checkpoint loading at startup

* Support wandb offline mode

* Fix regression_model_forward

* Update CHANGELOG.md

* Update README.md for DOMINO (#846)

* Update ERA5 download example (#845)

* Update era5 download example

* Update changelog

* Pytest speedup (#849)

* add code to measure time spent in pytest

* speed up datapipe tests

* fix cleanup of dist vars (was causing slowdown in test_capture.py)

* speed up model tests

* bring back some parameterizations, reduced cpu tests

* Add CELU activation function (#851)

* refactor: updating naming of a few files (modulus -> physicsnemo) (#850)

Co-authored-by: Oliver Hennigh <[email protected]>

* Various Corrdiff optimizations for drastic increase of training efficiency (#809)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <[email protected]>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <[email protected]>

* Lint and format code properly

Signed-off-by: Neal Pan <[email protected]>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <[email protected]>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <[email protected]>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <[email protected]>

* update tests

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <[email protected]>

* formatting

Signed-off-by: jialusui1102 <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>

* Catch improper use of patch gradient accumulation (#868)

* Update train.py to catch improper use of path grad acc

* Update train.py

* Update train.py

* Fixes compile of regression model in train.py

* Removed unused imports

Signed-off-by: Charlelie Laurent <[email protected]>

* Changed grad patch accumulation logic

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* This commit fixes two minor bugs in the physicsnemo profiling tools (#862)

- If line_profiler isn't available, it sometimes broke due to a missing check.
- If the torch profiler is used but the code exits before profiling, it will crash.

* Adding abokov-nv to authorized users to trigger blossom-ci.yml (#867)

* Fixes indexing issues and CPU memory consumption in dataloader (#879)

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* add WeightedOceanMSE to criterion

* add optional gaussian noise to inputs and coupled variables during training - should improve coupled stability

* add random seed - still need to test

* remove datatransformer code - shouldn't be part of this PR

* move logging

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* change back to 'n_layers' to match the old models

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* fix memory leak in coupled timeseries

* Add workflow to automatically sync changes from nvidia/modulus into main branch

* Fix the training and inference problem in nvidia modulus

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Fix indexing in constant coupler

Signed-off-by: root <[email protected]>

* Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

* enforce precedence of upstream modulus changes when auto syncing.

* set scaling for mean: 0, std: 1 where no change is needed

* add 'Multi_SymmetricConvNeXtBlock'

* Repalce 'n_layers' with 'n_conv_blocks' for clarity

* change back to 'n_layers' to match the old models

* fix memory leak in coupled timeseries

* add coupler fixes, var and time selection

* Fix for ordering on coupler

* batch size fix in coupler

* broken workflow cleanup

* cleanup for upstream merge (#20)

---------

Signed-off-by: root <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>

* CorrDiff: inference bugfixes, cleanup, and documentation improvements (#882)

* Disabled cuda profiler for cpu runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Added __init__ to avoid dataset module collision

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled torch emit_nvtx for cpu runs. Renamed 'test_train_split' to 'validation'

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed typo in 'set_patch_num'

Signed-off-by: Charlelie Laurent <[email protected]>

* More profiling stats disabled for CPU runs

Signed-off-by: Charlelie Laurent <[email protected]>

* Removed duplicate code in ResidualLoss

Signed-off-by: Charlelie Laurent <[email protected]>

* Disabled AMP in inference

Signed-off-by: Charlelie Laurent <[email protected]>

* Fixed f-strings in train script

Signed-off-by: Charlelie Laurent <[email protected]>

* Added details about validation and early-stopping in readme

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

* Stormcast customization conditions (#880)

* Add configurable model inputs

* Ensure pure diffusion works

* Update docstrings, error handling

* Update StormCast README.md and docstrings

* Minor revisions to StormCast README.md

* Fix typo in StormCast README.md

* Making the unit tests' nfs-data-dir configurable (#866)

* Update msc test with import_or_skip decorator (#884)

* update msc test with import_or_skip decorator

* linting

* update package name

---------

Co-authored-by: root <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Signed-off-by: root <[email protected]>
Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: nekobytz <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
Co-authored-by: abokov-nv <[email protected]>
Co-authored-by: David Pruitt <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>
Co-authored-by: root <[email protected]>

* Remove `wrapt` usage from all but one patch (and it's coming next.)

Some minor bug fixes found in regressions.

* Add tests to verify that shard tensor operations do not trigger on torch.Tensor objects

* Automatically enable all shard tensor execution paths, now that they
are implemented with torch_fcuntion and torch_dispatch and can not
run on standard torch.Tensors.

* Add first draft of tutorial for extending shard tensor.

* Update tests to accomodate new domino model.  Minor tweaks to domino (#889)

docstring to make clears sizes > 0.

* fixing minor bugs

* Exclusively fix linting errors. (#895)

* Domino datapipe test (#896)

* Fix ruff error

* Add test for domino datapipe

* Fix ruff error.

* Remove numpy conversion since sdf now returns a numpy array directly

* Enable cupy usage in computing scaling factors.

* Add a test on consequtive reductions, which was really failing when doing a full reduction to scalar when one tensor was a replicated tensor.

* Update training scripts slightly: move nvinit from pynvml to after distributed init

* Add more context to the tutorial on implementing custom domain parallel layers

* Ensure sharded ops are mapped, and update domain parallel tutorial

* Slight reorganization of scripts in the tutorials to make testing and maintenance easier.

* Fix typos and links and minor details in domain parallel tutorials.

* Remove file that should not be present

* Remove file that should not be present (2)

* Update ahmed_body.ipynb

Undo breaking change.

* Update profiling.rst

Put whitespace back.

* Update test.py

Undo changed AIR_DENSITY...

* Ensure download links are not messed up ...

* Update download_dataset.sh

Fix file name

* Fix typo:p primative fixed to primitive.

* Increase verbosity of  comments in tutorial scripts

* Update examples/cfd/external_aerodynamics/domino/README.md

Co-authored-by: Peter Sharpe <[email protected]>

* Fix bug in shard tensor dispatch.

Update tests to ensure they are passing with latest changes.
(Mostly, at 8 gpus making sure there is enough data)

* Ensure domino returns properly after removing comparison tools.

* Fix failing test in CI...hopefully.

* Resolve comments from PR Review.

* Fix broken import in tests

* Missing multigpu marker

* Fix broken imports for shard tensor dispatch bindings

* tensors need to be contiguous for NCCL

---------

Signed-off-by: root <[email protected]>
Signed-off-by: Chris Hawes <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Rishi Ranade <[email protected]>
Co-authored-by: nvssh nssswitch user account <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Michael Mara <[email protected]>
Co-authored-by: Derek Lai <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Ikko Eltociear Ashimine <[email protected]>
Co-authored-by: Nicholas Geneva <[email protected]>
Co-authored-by: ram-cherukuri <[email protected]>
Co-authored-by: David Pruitt <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: zacespinosa <[email protected]>
Co-authored-by: nathanielcresswellclay <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Bowen Liu <[email protected]>
Co-authored-by: Yair Cohen <[email protected]>
Co-authored-by: Kaustubh Tangsali <[email protected]>
Co-authored-by: Alexey Kamenev <[email protected]>
Co-authored-by: Peter Sharpe <[email protected]>
Co-authored-by: Simon Byrne <[email protected]>
Co-authored-by: Alexey Kamenev <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: WG <[email protected]>
Co-authored-by: chris-hawes <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
Co-authored-by: Peter Harrington <[email protected]>
Co-authored-by: RishikeshRanade <[email protected]>
Co-authored-by: Jussi Leinonen <[email protected]>
Co-authored-by: Mohammad Amin Nabian <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-au…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 5 - Merge After Dependencies Depends on another PR: do not merge out of order Earth-2 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants